System and method for reduced SSD failure via analysis and machine learning

ABSTRACT

Various implementations described herein relate to systems and methods for predicting and managing drive hazards for Solid State Drive (SSD) devices in a data center, including receiving telemetry data corresponding to SSDs, determining future hazard of one of those SSDs based on an a-priori model or machine learning, and causing migration of data from that SSD to another SSD.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent applicationSer. No. 15/908,287 filed Feb. 28, 2018, now U.S. Pat. No. 10,635,324,the contents of which are incorporated herein by reference in theirentirety.

TECHNICAL FIELD

The present disclosure generally relates generally to systems andmethods for improving reliability of Solid State Drives (SSDs) on asystem level.

BACKGROUND

SSDs are advanced components having internal controllers that arecapable of advanced services. These services include monitoring,logging, and error handling services that provide visibility tonon-volatile memory (for example, NAND flash memory) and FlashTranslation Layer (FTL) status. The growing density of NAND flash memoryand the shift toward triple level cell (TLC) and quadruple level cell(QLC) using 3D lithography pose new challenges to SSDs. These newchallenges include but are not limited to, accumulation of bad blocks,high error rate, die failure, etc. Failure to address such challengescauses increased failure rates of SSDs. On the other hand,disaggregation and Software Defined Storage (SDS) promoteinfrastructure-based architecture managed and orchestrated by a centralmanagement entity. For example, Intel®'s Rack Scale Design (RSD)supports pod management of multiple storage devices into virtual volumesthat bind to hosts on a rack/pod level.

SUMMARY

In certain aspects, the present implementations are directed to systemsand methods for improving reliability of SSDs on a system level byreducing failure overhead through predicting drive failure and migratingdata to other drives in advance. In some implementations, a centralmanagement device (e.g., a pod manager) may collect SSD telemetryinformation, for example, by sampling telemetry information from varioussystem level drives, each of which is referred to as an SSD device. Thecentral management device may analyze the SSD telemetry information anddetect, based on the analysis, drive failures in advance. The centralmanagement device may address the predicted drive failure by, forexample, migrating data to other drives and/or preventing the predicteddrive failure. In some implementations, an a-priori knowledge base canbe implemented to enable prediction of future drive failures based onvarious types of telemetry information. In some implementations, machinelearning can be used predict the future drive failure using priorknowledge of behavior of the SSDs.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 shows a block diagram of a data center, according to someimplementations;

FIG. 2 shows a flow chart of a process for predicting future hazards ofa SSD device based on machine learning, according to someimplementations;

FIG. 3 shows a flow chart of a process for predicting future hazards ofa SSD device based on machine learning, according to someimplementations;

FIG. 4 shows a flow chart of a process for predicting future hazards ofa SSD device based on an a-priori model, according to someimplementations; and

FIG. 5 shows a flow chart of a process for predicting future hazards ofa SSD device based on an a-priori model, according to someimplementations.

DETAILED DESCRIPTION

Among other aspects, Applicant recognizes that vendors densely pack NANDflash memory in storage solutions to reduce costs. Such packing includesadding more bits per cell, using 3 bits (TLC) or 4 bits (QLC) per cell,adding more bits per package (e.g., using 3D lithography), and usingsmaller form factors (e.g., M.2, Ruler) to pack more drives in anenclosure. Such measures to increase bit and storage form factor densitypose new challenges to SSDs. For instance, the cost-cutting measuresmentioned above cause an increase in error rate such that stronger ErrorCorrection Codes (ECCs) are needed. In addition, endurance is reduceddue to fewer Program/Erase (P/E) cycles being possible. Furthermore, badblocks can accumulate faster, thus increasing probability of diefailure. Still further, temperature sensitivity can be increased,leading to reduced retention.

Further, Applicant observes that the typical SSD data centerarchitecture is moving away from “closed boxes” (e.g., storageappliances) toward disaggregated environments. The concept ofdisaggregated environments refers to SSDs being disaggregated from anappliance to form a distributed “pool” of storage in which a singlestorage appliance's software is replaced with a central managementdevice to manage the distributed pool of storage. The central managementdevice can carve out virtual volumes and allocate those virtual volumesfrom the pool to individual compute nodes. Currently, such allocation isexecuted based on capacity and reliability, involving Redundant Array ofIndependent Disks (RAID) and Quality of Service (QoS).

A SSD device or drive can include a controller for the non-volatilememory, typically NAND flash memory, to implement management of thememory, error correction, FTL, wear leveling, garbage collection, andthe like. Due to the complexity of the SSDs, the controllers can collectand maintain various parameters reflective of an internal state of theSSDs. In some implementations, telemetry information with respect toSSDs is gathered. Such telemetry information includes but is not limitedto, endurance information, ECC information, error rate, utilizationinformation, temperature, workload information, and the like.

Accordingly, implementations described herein are directed to a set ofrules and algorithms to address technical issues present in data centersutilizing SSDs. The described implementations reduce SSD failure bydetecting hazards and addressing those hazards in advance, thusproviding predictive solutions. Due to the numerous types of telemetrydata, the amount of processing power needed for analytics, and theamount of telemetry data needed to gain critical mass for accurateanalytics, computer modeling and machine learning can be leveraged toaddress deficiencies in non-computerized analytics. Thus,implementations described herein automate a predictive process notpreviously automated.

To assist in illustrating certain aspects of the presentimplementations, FIG. 1 shows a block diagram of a data center 100according to some implementations. The data center 100 includes a SSDpool 105 having a plurality of platforms 110 a-110 n. Each of theplatforms 110 a-110 n supports a plurality of SSD devices. For example,the platform 110 a supports SSD devices 112 a, 112 b, . . . , and 112 n.While features of the platform 110 a are described herein for the sakeof brevity and clarity, one of ordinary skill in the art can appreciatethat the other platforms 110 b-110 n can be similarly implemented.

The platform 110 a may include computer nodes with internal storage,Just a Bunch Of Flash (JBOF) as storage nodes, or both. In someexamples, the platform 110 a may correspond to at least one rack or podpopulated with a plurality of computing nodes (running applications), aplurality of storage devices (maintaining data), or a combinationthereof. The computing nodes can run applications such as Non-StructuredQuery Language (NoSQL) databases. The storage devices can includeNon-Volatile Memory (NVM) devices that maintain data, typically NANDflash memory, but examples of other non-volatile memory technologiesinclude, but are not limited to, Magnetic Random Access Memory (MRAM),Phase Change Memory (PCM), Ferro-Electric RAM (FeRAM) or the like.Regardless of whether the platform 110 a employs a computer nodesarchitecture, a JBOF architecture, or both, the platform 110 a can serveone or more applications with a given workload using the SSDs 112 a-112n.

In some implementations, the SSDs 112 a-112 n corresponding to thecomputing nodes and the storage devices within the platform 110 a areconnected to a Top of Rack (TOR) switch 102 and can communicate witheach other via the TOR switch 102 or another suitable intra-platformcommunication mechanism. In some implementations, at least one router104 may facilitate communications between the SSDs 112 a-112 n(corresponding to the computing devices and/or storage devices) indifferent platforms. Furthermore, the router 104 can facilitatecommunication between the platform 110 a (e.g., the SSD devices 112a-112 n) and the central management device 120.

Each of the SSD devices 112 a-112 n includes a controller 114 a, 114 b,. . . , or 114 n, respectively. Each of the SSD devices 112 a-112 nincludes NAND flash memory devices 116 a, 116 b, . . . , or 116 n.Although for the sake of brevity and clarity, the SSD device 112 a isdescribed, one or ordinary skill in the art can appreciate that otherSSD devices 112 b-112 n can be likewise implemented.

As shown, the SSD device 112 a includes the controller 114 a and theNAND flash memory devices 116 a. The NAND flash memory devices 116 a areflash memory and include one or more individual NAND flash dies, whichare NVMs capable of retaining data without power. Thus, the NAND flashmemory devices 116 a refer to a plurality of NAND flash memory devicesor dies within the SSD device 112 a. The controller 114 a can combineraw data storage in a plurality of NAND flash memory devices(collectively referred to herein as the NAND flash memory devices 116 a)such that those NAND flash memory devices function like a single drive.The controller 114 a can include microcontrollers, buffers, errorcorrection functionality, flash translation layer (FTL) and flashinterface modules for implementing such functions. In that regard, theSSD device 112 a can be referred to as a “drive.”

The controller 114 a includes suitable processing and memorycapabilities for executing functions described herein. As described, thecontroller 114 a manages various features for the NAND flash memorydevices 116 a including, but not limited to, I/O handling, reading,writing, erasing, monitoring, logging, error handling, garbagecollection, wear leveling, logical to physical address mapping and thelike. Thus, the controller 114 a provides visibility to the NAND flashmemory devices 116 a and FTLs associated thereof. Further, thecontroller 114 a can collect telemetry information about the NAND flashmemory devices 116 a. In some examples, the controller 114 a can gatherthe telemetry information via high-frequency sampling of the NAND flashmemory devices 116 a. Thus, the controller 114 a can serve as atelemetry agent embedded within the SSD device 112 a.

Various types of telemetry information can be collected by thecontroller 114 a. Examples of the types of the telemetry informationcollectable by the controller 114 a include, but are not limited to,drive information and data input/output (I/O) workload information.Drive information (also referred to as drive telemetry information orFTL information) includes parameters, data, and values that reflect aninternal state of the SSD device 112 a. Such drive information can begathered from an internal FTL counter operatively coupled to orotherwise associated with the SSD device 112 a. In that regard, thecontroller 114 a may include the FTL counter. In some implementations,the controller 114 a may further include a thermometer for measuring thetemperature of the SSD device 112 a. As such, the drive information canbe determined per drive (e.g., per SSD device). Table 1 illustratesexamples of the types of drive information collected by the controller114 a.

TABLE 1 Parameter Description Erase count Average erase count of blocksDrive write Data written to SSD device Media write Data written to theNAND Utilization Data utilization of SSD device Bad block count Numberof bad blocks in SSD device BER histogram Error rate per block ECC stateNumber of blocks in each ECC state histogram (algorithm) temperature SSDdevice temperature

As shown in Table 1, examples of the types of drive information include,but are not limited to, erase count, drive write, media write,utilization, bad block count, Bit Error Rate (BER) histogram, ECC statehistogram, and temperature. Erase count refers to an average erasecount, taking into account all blocks in the NAND flash memory devices116 a. Drive write refers to an amount of data written to the NAND flashmemory devices 116 a. Media write refers to an amount of data written toa particular one of the NAND flash memory devices 116 a. Utilizationrefers to an amount of data storage capabilities of the NAND flashmemory devices 116 a that is currently being utilized. Bad block countrefers to a number of bad blocks in the NAND flash memory devices 116 a.BER histogram refers to an error rate per block. ECC refers to a numberof blocks in each ECC state or algorithm. Temperature refers to atemperature of the SSD device 112 a.

In some implementations, the drive information can be used to determinevarious types of drive hazard information of the SSD device 112 a,including but not limited to Write Amplification (WA), drive datautilization, wear-out state, and corollary latency. For example, drivedata utilization can be deduced from erase count, bad blocks, and BER.Corollary latency can be deduced from the ECC state. Furthermore,endurance information can be a part of the drive hazard information.Given that endurance of the SSD device 112 a is reduced with fewer P/Ecycles, the endurance information can be determined based on the numberof Program/Erase (P/E) cycles (i.e., P/E count). In some examples, thenumber of P/E cycles can be used as a proxy for endurance information.

Workload information (also referred to as workload profile) includesparameters, data, and values that reflect current read/write data I/Oworkload of the SSD device 112 a. Such workload information can begathered from a data-path module operatively coupled to or otherwiseassociated with the SSD device 112 a. In that regard, the controller 114a may include the data-path module. Table 2 illustrates examples of thetypes of workload information collected by the controller 114 a.

TABLE 2 Parameter Description Read I/O count Counter of read commandsWrite I/O count Counter of write command Total write Total data writtento the drive Total read Total data read from the drive Write entropy H =− Σ_(Page)P(page) · log₂(P(page)) (1)

As shown in Table 2, examples of the types of the workload informationinclude, but are not limited to, read I/O count, write I/O count, totalwrite, total read, and write entropy. The read I/O count refers to anumber of read commands executed by the SSD device 112 a. The write I/Ocount refers to a number of write commands executed by the SSD device112 a. The total write parameter refers to a total amount of datawritten to the SSD device 112 a. The total read parameter refers to atotal amount of data stored by the SSD device 112 a that had been read.Write entropy (H) refers to a degree of randomness in data stored by theSSD device 112 a. In expression (1), “page” refers to pages in a blockof the NAND flash memory devices 116 a.

Such workload information can be used to determine various aspects ofthe SSD device 112 a including, but not limited to, Input/outputOperations per second (Tops), internal bandwidth, and randomness.

In addition, the controller 114 a may facilitate data migration from theSSD device 112 a to another SSD device (e.g., the SSD device 112 b) bycommunicating with an associated controller (e.g., the controller 114 b)via the TOR switch 102. Moreover, the controller 114 a may facilitatedata migration from the SSD device 112 a to another SSD device inanother platform (e.g., the platform 110 b) by communicating with thatplatform via the router 104.

In some implementations, the platform 110 a may include a platformtelemetry agent 118 configured to gather the environment information.The environment information (also referred to as platform telemetryinformation or environment profile) includes parameters, data, andvalues that reflect a state of hardware components on the platform 110a, including the state of an individual SSD device 112 a or chassis(e.g., a JBOF chassis) of the platform 110 a. Such environmentinformation can be gathered by a suitable platform, server, rack, or podmanagement entity. In that regard, the platform telemetry agent 118 maybe or may include the platform, server, rack, or pod management entity.Table 3 illustrates examples of the types of environment informationcollected by the controller 114 a.

TABLE 3 Parameter Description Temperature chassis temperature PowerPlatform power consumption

As shown in Table 3, examples of the environment information include,but are not limited to, temperature and power. The environmentinformation can be collected on a per-SSD basis, per-chassis basis,per-JBOF basis, or per-server basis. Temperature refers to thetemperature of a SSD. chassis, JBOF, or server associated with theplatform 110 a. Power refers to the power consumption of the SSD,chassis, JBOF, or server associated with the platform 110 a.

The telemetry information is timestamped by the controllers 114 a-114 nand the platform telemetry agent 118. The controllers 114 a-114 n andthe platform telemetry agent 118 (as well as those from other platforms110 b-110 n) are synchronized to a same clock. In some implementations,the telemetry information is sampled periodically (e.g., having asampling period of 1 hour) to assure detailed information and sufficientsample size.

The router 104 can further facilitate communications between the SSDpool 105 and a central management device 120. For example, thecontrollers 114 a-114 n can send the telemetry information (e.g., thedrive information and the workload information) to the centralmanagement device 120 via the router 104. In addition, the platformtelemetry agent 118 can send the telemetry information (e.g., theenvironment information) to the central management device 120 via therouter 104.

The data center 100 further includes the central management device 120.In some implementations, the central management device 120 is a podmanager that manages storage functions for one or more platforms 110a-110 n. The central management device 120 may possess considerablecomputing resources to enable detailed processing of raw telemetryinformation in order to predict future drive failure, among othertechnical issues concerning the SSD devices in the SSD pool 105. Forexample, the central management device 120 includes a processing circuit122 having a processor 124 and a memory 126.

The processor 124 may include any suitable data processing device, suchas but not limited to one or more Central Processing Units (CPUs). Insome implementations, the processor 124 can be implemented with one ormore microprocessors. In some implementations, the processor 124 mayinclude any suitable electronic processor, controller, microcontroller,or state machine. In some implementations, the processor 124 isimplemented as a combination of computing devices (e.g., a combinationof a Digital Signal Processor (DSP) and a microprocessor, a plurality ofmicroprocessors, at least one microprocessor in conjunction with a DSPcore, or any other such configuration). In some implementations, theprocessor 124 is implemented as one or more Application SpecificIntegrated Circuits (ASICs), one or more Field Programmable Gate Arrays(FPGAs), one or more DSPs, a group of processing components, or othersuitable electronic processing components.

The memory 126 includes a non-transitory processor-readable storagemedium that stores processor-executable instructions executable by theprocessor 124. In some embodiments, the memory 126 includes any suitableinternal or external device for storing software and data. Examples ofthe memory 126 include but are not limited to, Random Access Memory(RAM), Dynamic RAM (DRAM), Read-Only Memory (ROM), Non-Volatile RAM(NVRAM), flash memory, floppy disks, hard disks, dongles or other RecompSensor Board (RSB)-connected memory devices, or the like. The memory 126can store an Operating System (OS), user application software, and/orexecutable instructions. The memory 126 can also store application data,such as an array data structure. In some embodiments, the memory 126stores data and/or computer code for facilitating the various processesdescribed herein.

The central management device 120 may further include a telemetryinformation database 128 structured to store the telemetry information.Responsive to the telemetry information sampling being complete, thecontroller 114 a-114 n and the platform telemetry agent 118 send thetelemetry information to the central management device 120. Thetelemetry data collected with respect to the other platforms 110 b and110 n can be likewise sent to the central management device 120. Thetelemetry information database 128 stores such telemetry information.

The central management device 120 may provision and assign volumes ofSSD devices 112 a-112 n in the platforms 110 a-110 n in someimplementations. For instance, the central management device 120 mayinclude a drive assignment module 132 for provisioning and assigningvolumes of SSD devices (e.g., the SSD devices 112 a-112 n) in the SSDpool 105. In some implementations, the drive assignment module 132 canreassign volumes based on the predicted drive failure determined by aprediction module 130. This constitutes translating the predicted drivefailure into actionable directives to be sent back to the SSD pool 105.The drive assignment module 132 can be implemented with the processingcircuit 122 or another suitable processing unit.

The central management device 120 further includes the prediction module130 that predicts failures or other hazards involving the SSD devices112 a-112 n in the platforms 110 a-110 n in advance, before the failuresand other hazards occur. In some implementations, the prediction module130 corresponds to suitable Artificial Intelligence (AI) or machinelearning apparatus for generating and tuning a drive model or a failprofile, as well as for automatically predicting future failures andother hazards in advance. The prediction module 130 can be implementedwith the processing circuit 122 or another suitable processing unit.

In some implementations, the central management device 120 includes oris otherwise coupled to an information output device 140. For example,the information output device 140 includes any suitable deviceconfigured to display information, results, alerts, notifications,messages, and the like to an operator concerning the state of the SSDdevices in the SSD pool 105. In some implementations, the informationoutput device 140 includes but is not limited to, a computer monitor, aprinter, a facsimile machine, a touchscreen display device, or any otheroutput device performing a similar function.

Various methods can be implemented to predict future hazards using thetelemetry information. Examples of such methods include, but are notlimited to, a machine-based prediction method and an a-priori-basedprediction method.

FIG. 2 illustrates a flow chart of a process 200 for predicting futurehazards of a SSD device based on machine learning, according to someimplementations. In FIG. 2, blocks with sharp corners denote parameters,data, values, models, and information. Blocks with rounded cornersdenote processes, operations, and algorithms. Referring generally to theprocess 200, input parameters/states (e.g., telemetry information210-230) and disk hazard information 240 are used to derive a drivemodel 250 or a failure profile using a machine learning process 250employing AI and machine learning techniques.

As discussed, the platforms 110 a-110 n can periodically sample thetelemetry information and send the telemetry information via the router104 to the central management device 120, to be stored in the telemetryinformation database 128. For example, drive information 210 andworkload information 220 can be gathered and sent by the controllers 114a-114 n with respect to each of the SSD devices 112 a-112 n,respectively. The drive information 210 and workload information 220with respect to SSD devices on other platforms 110 b-110 n can besimilarly obtained. Environment information 230 with respect to theplatform 110 a can be gathered and sent by the platform telemetry agent118. The environment information 230 with respect to other platforms canbe similarly obtained. Such data can be gathered over a long period oftime (e.g., a few years) to assure a sufficient sample size fordeveloping the drive model 260.

Drive hazard information 240 refers to actual drive hazards experiencedwith respect to one or more of the SSD devices 112 a-112 n as thetelemetry information 210-230 is being sampled. The drive hazardinformation 240 can be derived from the telemetry information 210-230 bythe prediction module 130. In some examples, the drive hazardinformation 240 may include actual drive failures detected by thecontrollers 114 a-114 n or another suitable entity residing on theplatform 110 a. The drive hazard information 240 can be stored in thetelemetry information database 128. The drive hazard information 240includes at least a hazard type, a hazard time, and a device indicator.Examples of the hazard type include, but are not limited to, drivefailure, a wear-out state, a WA state, a drive data utilization state,and corollary latency. The hazard time is indicative of the moment intime that the hazard information is determined. The device indicatoridentifies the SSD device for which the hazard information isdetermined.

The telemetry information 210-230 and the drive hazard information 240are used as inputs to a machine learning process 250. The machinelearning process 250 determines the drive model 260 based on thoseinputs. In some implementations, the drive model 260 is assembled usingdata mining techniques. For example, the machine learning process 250includes determining correlations between the telemetry information210-230 and the drive hazard information 240. In one example, themachine learning process 250 involves the prediction module 130identifying correlations between drive temperature (e.g., as a part ofthe drive information 210) of a SSD device and a wear-out state of thesame SSD device. In another example, the machine learning process 250involves the prediction module 130 identifying correlations between anactual drive failure of a SSD device and erase count, drive write, mediawrite, utilization, bad block count, BER histogram, ECC state histogram,and temperature of the same SSD device. The sample size may include thetelemetry data 210-230 and the drive hazard information 240 collectedover a length period of time (e.g., a few years). Accordingly, the drivemodel 260 is developed by learning from the correlation between actualdrive hazard information 240 (e.g., actual drive failures) and thetelemetry information 210-230.

The drive model 260 having an input/output paradigm can be thuslydeveloped. The prediction module 130 can receive current telemetryinformation and input the same into the drive model 260. The drive model260 can output a predicted drive state. The predicted drive state mayindicate that a particular SSD device is functioning normally or isexpected to encounter a hazard.

With respect to a detected future hazard, the drive model 260 can outputa hazard type and an expected hazard occurrence time. In one example,the detected future hazard may be that in two months, a SSD device willhave a number of bad blocks exceeding a safe threshold. In anotherexample, the detected future hazard may be that in six months, a SSDdevice will have an ECC state that causes latency exceeding a latencythreshold for a percentage of blocks in that SSD device, where thepercentage exceeds a block threshold. In yet another example, thedetected future hazard may be that in a month, there is a probability(beyond a probability threshold) that a SSD device will have a BER thatcan cause read error.

FIG. 3 shows a flow chart of a process 300 for predicting future hazardsof a SSD device based on machine learning, according to someimplementations. At 310, the prediction module 130 receives thetelemetry information 210-230 corresponding to a plurality of SSDdevices. The plurality of SSD devices includes the SSD devices 112 a-112n of the platform 110 a as well as the SSD devices of the otherplatforms 110 b-110 n.

At 320, the prediction module 130 determines, using the machine learningprocess 250, the drive model 260 based on the telemetry information. Thedrive model 260 predicts future drive hazards. In some implementations,block 320 includes the prediction module 130 deriving the drive hazardinformation 240 based on the telemetry information 210-230. Theprediction module 130 derives the drive model 260 through machinelearning by correlating the telemetry information 210-230 with the drivehazard information 240. Blocks 310 and 320 are periodically iterated asblock 310 is executed periodically based on the sampling period. As newtelemetry information is received, the drive model 260 may change due tonew information that is inputted into the machine learning process 250.The drive model 260 can thusly updated using machine learning.

At 330, the prediction module 130 receives current telemetry informationcorresponding to the plurality of SSD devices. The current telemetryinformation may be the same types of telemetry information as thatreceived at block 310. The current telemetry information received atblock 330 can be fed back into the machine learning process 250 (e.g.,block 320) to update and tune the drive model 260.

At 340, the prediction module 130 determines whether a future hazard ispredicted with respect to a first SSD device (e.g., the SSD device 112a) of the plurality of SSD devices. Such determination is made based onthe drive model 260 and the current telemetry information. For example,the prediction module 130 can use the current telemetry information asinputs to the drive model 260, to determine whether a future hazard willoccur. The future hazard includes a hazard type and an expected hazardoccurrence time. Responsive to determining that no future hazard isdetect (340:NO), the method 300 ends.

On the other hand, responsive to determining that a future hazard isdetected with respect to at least a first SSD device (340:YES), thedrive assignment module 132 may cause data stored by the first SSDdevice to migrate to a second SSD device of the plurality of SSDdevices. In some example, the drive assignment module 132 may send amigration command to the router 104, which relays the command to thecontroller 114 a. The controller 114 a may send the data stored on oneor more of the NAND flash memory devices 116 a to another SSD device(e.g., the SSD device 112 b) via the TOR switch 102 in some examples. Insome examples, the controller 114 a may send the data stored on one ormore of the NAND flash memory devices 116 a to another SSD device onanother platform (e.g., the platform 110 b) via the router 104.

In addition, or in the alternative to the automated actions in step 350,the prediction module 130 may be configured to cause the informationoutput device 140 to display warnings corresponding to the future hazardof the first SSD device to an operator or administrator so that theoperator or administrator can take appropriate action.

FIG. 4 illustrates a flow chart of a process 400 for predicting futurehazards of a SSD device based on an a-priori model, according to someimplementations. In FIG. 4, blocks with sharp corners denote parameters,data, values, models, and information. Blocks with rounded cornersdenote processes, operations, and algorithms. Referring generally to theprocess 400, an a-priori drive model is used as a starting point topredict a drive state (e.g., a predicted drive state 420) based on theworkload information 220. Differences between the predicted drive state420 and actual drive state (corresponding to the drive information 210)can be used to tune the a-priori drive model and for hazard detection450. In some implementations, the drive model 260 determined at anymoment in time by the prediction module 130 can be used as the a-prioridrive model, in a hybrid machine learning and a-priori drive modelimplementation. Otherwise, the a-priori drive model refers to anysuitable model that can determine the predicted drive state 420 usingthe workload information 220.

In some implementations, the workload information 220 is used as inputto block 410 for prediction based on the a-priori drive model, todetermine the predicted drive state 420. The predicted drive state 420includes, in some implementations, the same types of data as the driveinformation 210. For example, the predicted drive state 420 may includea predicted ECC state histogram or a predicted BER histogram. In someimplementations, the predicted drive state 420 includes the same typesof data as the drive hazard information. For example, the predicteddrive state 420 may include a predicted WA and a predicted P/E count.

The predicted drive state 420 is compared with the drive information 210received from the SSD pool 105 (and the drive hazard information derivedfrom the drive information 210 by the prediction module 130) at 430. Thedrive information 210 and the drive hazard information are referredcollectively as the actual drive state. Such differences 440 can bederived from model accuracy, drive variance, environment information230, and the like. For instance, model accuracy can be determined basedon internal elements such as Read Disturb. Drive variance can bedetermined based on NAND flash memory yields.

The actual drive state and the differences 440 between the actual drivestate and the predicted drive state can be used as inputs back to thea-priori drive model for first order tuning of the model. In otherwords, the actual drive state and the differences 440 can facilitate inadapting coefficients present in the a-priori drive model to real worldresults (e.g., the actual drive state).

The differences 440 can be used for hazard detection 450. For example,when the actual drive state of a SSD device is worse than the predicteddrive state 420, it can be assumed that the workload (reflected by theworkload information 220) is pushing the SSD device to a failure state.The result of hazard detection 450 can be a detected future hazard,which includes a hazard type and an expected hazard occurrence time. Inone example, the detected future hazard may be that in two months, a SSDdevice will have a number of bad blocks exceeding a safe threshold. Inanother example, the detected future hazard may be that in six months, aSSD device will have an ECC state that causes latency exceeding alatency threshold for a percentage of blocks in that SSD device, wherethe percentage exceeds a block threshold. In yet another example, thedetected future hazard may be that in a month, there is a probability(beyond a probability threshold) that a SSD device will have a BER thatcan cause read error.

Responsive to determining the detected future hazard, the informationoutput device 140 can optionally alert an operator at 460, who candecide whether to take action. The information output device 140 canindependently, or under instruction from the operator, instruct thedrive assignment module 132 can cause migration of the data stored onthe SSD device associated with the detected future hazard to evacuatethe data, at 470. Alternatively, responsive to determining the detectedfuture hazard, the drive assignment module 132 can directly causemigration of the data stored on the SSD device associated with thedetected future hazard to evacuate the data, at 470.

FIG. 5 illustrates a flow chart of a process 500 for predicting futurehazards of a SSD device based on an a-priori model, according to someimplementations. At 510, the prediction module 130 receives the workloadinformation 220 corresponding to a plurality of SSD devices. Theplurality of SSD devices includes the SSD devices 112 a-112 n of theplatform 110 a as well as the SSD devices of the other platforms 110b-110 n.

At 520, the prediction module 130 determines, based on the a-prioridrive model, a predicted drive state 420. At 530, the prediction module130 determines differences between the predicted drive state 210 and theactual drive state. Block 530 includes the prediction module 130receiving current drive information 210 corresponding to the pluralityof SSD devices. The prediction module 130 can derive the drive hazardinformation from the drive information 210. The actual drive stateincludes both the drive information and the drive hazard information.

At 540, the prediction module 130 determines whether a future hazard ispredicted with respect to a first SSD device (e.g., the SSD device 112a) of the plurality of SSD devices. Such determination is made based onat least the differences. In some implementations, such determination isbased on both the differences and the actual drive state. The futurehazard includes a hazard type and an expected hazard occurrence time.Responsive to determining that no future hazard is detect (540:NO), themethod 500 ends.

On the other hand, responsive to determining that a future hazard isdetected with respect to at least a first SSD device (540:YES), thedrive assignment module 132 may cause data stored by the first SSDdevice to migrate to a second SSD device of the plurality of SSDdevices. In some example, the drive assignment module 132 may send amigration command to the router 104, which relays the command to thecontroller 114 a. The controller 114 a may send the data stored on oneor more of the NAND flash memory devices 116 a to another SSD device(e.g., the SSD device 112 b) via the TOR switch 102 in some examples. Insome examples, the controller 114 a may send the data stored on one ormore of the NAND flash memory devices 116 a to another SSD device onanother platform (e.g., the platform 110 b) via the router 104.

In addition, or alternatively to the automated actions in step 550, theprediction module 130 may be configured to cause the information outputdevice 140 to display warnings corresponding to the future hazard of thefirst SSD device to an operator or administrator. The operator oradministrator may then take action as required, which may includeinstructing the drive assignment module via the information outputdevice 140 to perform the migration.

In some implementations, the methods 200(300) and 400(500) can beexecuted concurrently. When the drive model 260 generated by the machinelearning process 250 is deemed to be mature, the a-priori drive modelused at 410 can be replaced with the drive model 260. In that regard,the rest of the method 400(500) can remain the same.

The previous description is provided to enable any person skilled in theart to practice the various aspects described herein. Variousmodifications to these aspects will be readily apparent to those skilledin the art, and the generic principles defined herein may be applied toother aspects. Thus, the claims are not intended to be limited to theaspects shown herein, but is to be accorded the full scope consistentwith the language claims, wherein reference to an element in thesingular is not intended to mean “one and only one” unless specificallyso stated, but rather “one or more.” Unless specifically statedotherwise, the term “some” refers to one or more. All structural andfunctional equivalents to the elements of the various aspects describedthroughout the previous description that are known or later come to beknown to those of ordinary skill in the art are expressly incorporatedherein by reference and are intended to be encompassed by the claims.Moreover, nothing disclosed herein is intended to be dedicated to thepublic regardless of whether such disclosure is explicitly recited inthe claims. No claim element is to be construed as a means plus functionunless the element is expressly recited using the phrase “means for.”

It is understood that the specific order or hierarchy of steps in theprocesses disclosed is an example of illustrative approaches. Based upondesign preferences, it is understood that the specific order orhierarchy of steps in the processes may be rearranged while remainingwithin the scope of the previous description. The accompanying methodclaims present elements of the various steps in a sample order, and arenot meant to be limited to the specific order or hierarchy presented.

The previous description of the disclosed implementations is provided toenable any person skilled in the art to make or use the disclosedsubject matter. Various modifications to these implementations will bereadily apparent to those skilled in the art, and the generic principlesdefined herein may be applied to other implementations without departingfrom the spirit or scope of the previous description. Thus, the previousdescription is not intended to be limited to the implementations shownherein but is to be accorded the widest scope consistent with theprinciples and novel features disclosed herein.

The various examples illustrated and described are provided merely asexamples to illustrate various features of the claims. However, featuresshown and described with respect to any given example are notnecessarily limited to the associated example and may be used orcombined with other examples that are shown and described. Further, theclaims are not intended to be limited by any one example.

The foregoing method descriptions and the process flow diagrams areprovided merely as illustrative examples and are not intended to requireor imply that the steps of various examples must be performed in theorder presented. As will be appreciated by one of skill in the art theorder of steps in the foregoing examples may be performed in any order.Words such as “thereafter,” “then,” “next,” etc. are not intended tolimit the order of the steps; these words are simply used to guide thereader through the description of the methods. Further, any reference toclaim elements in the singular, for example, using the articles “a,”“an” or “the” is not to be construed as limiting the element to thesingular.

The various illustrative logical blocks, modules, circuits, andalgorithm steps described in connection with the examples disclosedherein may be implemented as electronic hardware, computer software, orcombinations of both. To clearly illustrate this interchangeability ofhardware and software, various illustrative components, blocks, modules,circuits, and steps have been described above generally in terms oftheir functionality. Whether such functionality is implemented ashardware or software depends upon the particular application and designconstraints imposed on the overall system. Skilled artisans mayimplement the described functionality in varying ways for eachparticular application, but such implementation decisions should not beinterpreted as causing a departure from the scope of the presentdisclosure.

The hardware used to implement the various illustrative logics, logicalblocks, modules, and circuits described in connection with the examplesdisclosed herein may be implemented or performed with a general purposeprocessor, a DSP, an ASIC, an FPGA or other programmable logic device,discrete gate or transistor logic, discrete hardware components, or anycombination thereof designed to perform the functions described herein.A general-purpose processor may be a microprocessor, but, in thealternative, the processor may be any conventional processor,controller, microcontroller, or state machine. A processor may also beimplemented as a combination of computing devices, e.g., a combinationof a DSP and a microprocessor, a plurality of microprocessors, one ormore microprocessors in conjunction with a DSP core, or any other suchconfiguration. Alternatively, some steps or methods may be performed bycircuitry that is specific to a given function.

In some exemplary examples, the functions described may be implementedin hardware, software, firmware, or any combination thereof. Ifimplemented in software, the functions may be stored as one or moreinstructions or code on a non-transitory computer-readable storagemedium or non-transitory processor-readable storage medium. The steps ofa method or algorithm disclosed herein may be embodied in aprocessor-executable software module which may reside on anon-transitory computer-readable or processor-readable storage medium.Non-transitory computer-readable or processor-readable storage media maybe any storage media that may be accessed by a computer or a processor.By way of example but not limitation, such non-transitorycomputer-readable or processor-readable storage media may include RAM,ROM, EEPROM, FLASH memory, CD-ROM or other optical disk storage,magnetic disk storage or other magnetic storage devices, or any othermedium that may be used to store desired program code in the form ofinstructions or data structures and that may be accessed by a computer.Disk and disc, as used herein, includes compact disc (CD), laser disc,optical disc, digital versatile disc (DVD), floppy disk, and blu-raydisc where disks usually reproduce data magnetically, while discsreproduce data optically with lasers. Combinations of the above are alsoincluded within the scope of non-transitory computer-readable andprocessor-readable media. Additionally, the operations of a method oralgorithm may reside as one or any combination or set of codes and/orinstructions on a non-transitory processor-readable storage mediumand/or computer-readable storage medium, which may be incorporated intoa computer program product.

The preceding description of the disclosed examples is provided toenable any person skilled in the art to make or use the presentdisclosure. Various modifications to these examples will be readilyapparent to those skilled in the art, and the generic principles definedherein may be applied to some examples without departing from the spiritor scope of the disclosure. Thus, the present disclosure is not intendedto be limited to the examples shown herein but is to be accorded thewidest scope consistent with the following claims and the principles andnovel features disclosed herein.

What is claimed is:
 1. A central management device configured to managea plurality of solid state drive (SSD) devices in a data center, thecentral management device comprising: a prediction module implementedwith a processing circuit having a processor and a memory, theprediction module configured to: identify drive hazard information,wherein the drive hazard information comprises one or both ofinformation about an actual drive failure or hazard information;identify a drive model that is generated using machine learning based ontelemetry information and the drive hazard information, wherein thedrive model is configured to predict future drive hazards; receivecurrent telemetry information corresponding to the plurality of SSDdevices; tune the drive model, using machine learning, based on thecurrent telemetry information corresponding to the plurality of SSDdevices; and determine, based on the tuned drive model and the currenttelemetry information, a future hazard of a first SSD device of theplurality of SSD devices; and a drive assignment module that isconfigured to provision the plurality of SSD devices based on thedetermined future hazard.
 2. The central management device of claim 1further comprising a telemetry database that is configured to store thedrive hazard information, wherein the prediction module identifies thedrive hazard information by accessing the telemetry database.
 3. Thecentral management device of claim 1, wherein the telemetry informationcomprises drive information, data I/O workload information, andenvironment information.
 4. The central management device of claim 3,wherein the drive information comprises at least one of erase count,drive write, media write, utilization, bad block count, Bit Error Rate(BER) histogram, ECC state histogram, and temperature of the pluralityof SSD devices.
 5. The central management device of claim 3, wherein thedata I/O workload information comprises read I/O count, write I/O count,total write, total read, and write entropy of the plurality of SSDdevices.
 6. The central management device of claim 3, wherein theenvironment information comprises temperature and power consumption ofat least one platform supporting the plurality of SSD devices; theenvironment information is collected by a platform telemetry agent. 7.The central management device of claim 1, wherein the plurality of SSDdevices are supported by two or more platforms; and each of theplurality of SSD devices comprises a controller and one or more NANDflash memory devices.
 8. The central management device of claim 7,wherein the controller collects the telemetry information with respectto the plurality of SSD devices; and the telemetry information collectedby the controller comprises drive information and workload information.9. The central management device of claim 7, wherein the controllercomprises at least one of a data-path module, a thermometer, or a FlashTranslation Layer (FTL) counter.
 10. The central management device ofclaim 1, further comprising an information output device, wherein theprediction module is further configured to cause the information outputdevice to display warnings corresponding to the future hazard of thefirst SSD device.
 11. The central management device of claim 1, whereinthe prediction module identifies the drive model by correlating thetelemetry information with the drive hazard information.
 12. The centralmanagement device of claim 1, wherein the provisioning performed by thedrive assignment module includes causing data stored by the first SSDdevice to migrate to the second SSD device automatically responsive tothe determination of the future hazard by the prediction module.
 13. Thecentral management device of claim 1, wherein the provisioning performedby the drive assignment module includes causing data stored by the firstSSD device to migrate to the second SSD device as a result of an alertto a user in connection with the determination of the future hazard bythe prediction module.
 14. A method for managing a plurality of solidstate drive (SSD) devices in a data center, the method comprising:identifying drive hazard information corresponding to the plurality ofSSD devices; identifying a drive model that is generated using machinelearning based on telemetry information and the drive hazardinformation, wherein the drive model is configured to predict futuredrive hazards; receiving current telemetry information corresponding tothe plurality of SSD devices; tuning the drive model, using machinelearning, based on the current telemetry information corresponding tothe plurality of SSD devices; determining, based on the tuned drivemodel and the current telemetry information, future hazard of a firstSSD device of the plurality of SSD devices; and provisioning theplurality of SSD devices based on the determined future hazard.
 15. Themethod of claim 14, wherein provisioning includes: causing data storedby the first SSD device to migrate to a second SSD device of theplurality of SSD devices.
 16. The method of claim 15, wherein causingdata stored by the first SSD device to migrate to the second SSD deviceis performed automatically responsive to the determination of the futurehazard.
 17. The method of claim 15, wherein causing data stored by thefirst SSD device to migrate to the second SSD device is performed as aresult of an alert to a user in connection with the determination of thefuture hazard.
 18. A central management device configured to manage aplurality of solid state drive (SSD) devices in a data center, thecentral management device comprising: a prediction module implementedwith a processing circuit having a processor and a memory, theprediction module configured to: determine, based on an a-priori drivemodel, data I/O workload information and a current drive state, apredicted drive state for each of the plurality of SSD devices;determine differences between the predicted drive state and an actualdrive state; determine, based on the differences, future hazard of afirst SSD device of the plurality of SSD devices; and feed back thedifferences so as to update the a-priori drive model; and a driveassignment module configured to provision the plurality of SSD devicesbased on the determined future hazard of the first SSD device.
 19. Thecentral management device of claim 18, wherein the drive assignmentmodule is configured to perform the provisioning by causing data storedby the first SSD device to migrate to a second SSD device of theplurality of SSD devices.
 20. The central management device of claim 19,wherein the drive assignment module is configured to cause data storedby the first SSD device to migrate to the second SSD deviceautomatically responsive to the determination of the future hazard bythe prediction module.